A walkthrough of Ernest (2005)’s original analytical approach, from close reading of the paper.
Questions
- Is energy use across body size categories (regardless of species) uniform or multimodal?
- uniform would correspond generally to energetic equivalence/Damuth’s rule.
- multimodal might suggest different resource availability for different body sizes.
- If energy use is not uniform across body size categories, does the species level body size distribution correspond to modes of energy use?
- i.e. are there more species with mean body sizes around the modes of the body size-energy use distribution?
- if so, maybe it’s good to be certain sizes, and species accumulate at those optima.
Data
Ernest data
Ernest drew data from the Andrews LTER, the Sevilleta, Niwot Ridge, and Portal. For details of these field sites, see the relevant publications cited in Ernest (2005).
Translation to replicate-becs
The same datasets are available online, from these sources.
download_raw_paper_data(storagepath = storagepath)
Process raw data into the appropriate format. This is a data table with a record for each individual and columns for species and weight in grams. By default these tables will be stored in subdirectores of replicate-becs/data/paper/processed.
process_raw_data(storagepath = storagepath)
Loading in data version 1.106.0
[1] TRUE
Load data tables for each community.
communities <- load_paper_data(storagepath = storagepath)
Each community should be a data table with columns for species and size for each individual, for example:
names(communities)[[1]]
[1] "andrews"
head(communities[[1]])
Comparing data from 2005 to data available in 2019
Although the same datasets are now available online, they may have changed somewhat since 2005 (due to error checking, etc). Ernest (2005) may also have taken some cleaning and filtering steps, the details of which could have been omitted from the manuscript due to length restrictions. For example, many studies using data from the Portal Project omit ground squirrels, because although they may be within the “small mammal” size range, they are not target taxa for the sampling method.
Ernest (2005) reported summary statistics of her datasets; here I compare them to the corresponding dataset in the 2019 data.
summary_stats_comparison = compare_summary_stats(storagepath = storagepath)
Column `community_name` joining character vector and factor, coercing into character vector
print(summary_stats_comparison)
Constructing distributions/metrics
Body size-energy use distributions (BSED)
Ernest method
- Per individual, calculate metabolic rate as metabolic rate \(B \propto M^\frac{3}{4}\) where \(M\) is mass in grams.
- Sum energy use of all individuals in body size classes of .2 natural log units.
- Also try classes of .1 and .3 natural log units
- Convert raw energy use values for each body size class into the proportion of all the energy used in that community used by that body size class. This allows for comparisons between communities.
Translation to replicate-becs
For every individual, calculate metabolic rate and assign to a size class.
communities_energy <- lapply(communities, FUN = make_community_table, ln_units = 0.2)
head(communities_energy[[1]])
For each community, sum total energy use for each size class, and convert to the proportion of total energy use for that community.
bseds <- lapply(communities_energy, FUN = make_bsed)
head(bseds[[1]])

Species-level body size distributions (BSD)
Ernest method
- Frequency distributions of mean mass of each species in a community.
- For plotting (but not statistics), smoothed using kernel density estimation.
- Gaussian kernel to mimic the actual body size distribution in log space
- avg. std dev of the mean of the logged masses = smoothing parameter \(h\)
- align sampling points with the midpoint of each size class in the BSED
- after Manly 1996, “Are there clumps in body-size distributions?”, Ecology
Translation to replicate-becs
Calculate mean mass of each species in each community.
bsds <- lapply(communities, FUN = make_bsd)
head(bsds[[1]])

Energetic dominance (\(D_E\))
- Define “energy use modes” as contiguous body size classes where the energy use of each size class > 5% of the community total.
- i.e. a little bit more than the expectation if energy use is uniform across all body sizes
- Calculate the total energy use for each species in the mode.
- Calculate the “dominance” of the species with the highest energy use in that mode as \(D_E = p_{max}\), where \(p_{max}\) is the maximum proportion of energy use by any one species in a mode.
- “a modification of the Berger-Parker dominance index (Berger and Parker 1970)”
Translation to replicate-becs
- Find contiguous size classes where each class has >5% of total energy use
- Calculate the total energy use for each species, and the proportion held by the species with the highest energy use (\(p_{max}\))
- Return \(p_{max}\) for every mode, along with the min and max size classes in that mode for each community
energetic_dom <- lapply(communities_energy, FUN = energetic_dominance)
head(energetic_dom[[1]])
- To plot, combine all modes from all communities and plot a histogram of \(D_E\) values.

Statistical tests
Compare BSEDs among communities
Ernest approach
- For every pair of communities, create a pool of masses of all individuals from both communities.
- Draw two new communities with the same number of individuals as the empirical communities, pulling masses at random from the pool, with replacement.
- Calculate the DOI for the BSEDs of the two sample communities.
- Repeat 10000 for each pair.
- The P value is the proportion of sample DOIs greater (i.e. less overlap) than the empirical value.
Translation to replicate-becs
- For every pair of communities, pool all the masses
- Resample two communities of the right sizes
- Construct BSEDs for both communities
- Calculate the DOI of the two BSEDs
- Repeat 10000x
community_combination_indices = utils::combn(x = c(1:9), m = 2, simplify = TRUE) %>%
t() %>%
as.data.frame() %>%
dplyr::rename(community_a = V1, community_b = V2)
combine_communities = function(indices, communities) {
community_combination = list(community_a = communities[[indices[1]]], community_b = communities[[indices[2]]], community_names = c(names(communities)[[indices[1]]], names(communities)[[indices[2]]]))
return(community_combination)
}
community_combinations = apply(community_combination_indices, MARGIN = 1, FUN = combine_communities, communities = communities)
bsed_crosscomm_bootstraps = lapply(community_combinations, FUN = community_bootstrap,
bootstrap_function = 'bootstrap_crosscomm_bseds', nbootstraps = 10000)


See histogram of p values for comparisons to see if commuities’ BSEDs are the same or different.
Comparing BSDs among communities
Ernest approach
Ernest (2005) used a two-sample Kolmogorov-Smirnov test to compare every possible combination of community-level BSDs.
Translation to replicate-becs
# use same community combinations as before
bsd_crosscomm_ks = lapply(community_combinations, FUN = ks_bsd,
ln_mass_vals = F)

---
title: "Narrative of original analysis"
author: "Renata Diaz"
date: "5/14/2019"
output: html_notebook
---

```{r setup, include = F}
library(replicatebecs)
download_data = TRUE
source_sims = FALSE
set.seed(352) # GNV area code, for fun
storagepath = '/Users/renatadiaz/Desktop/toy-becs'
#storagepath = here::here('files')
setup_files(storagepath = storagepath)
```

A walkthrough of Ernest (2005)'s original analytical approach, from close reading of the paper. 

## Questions

1. Is energy use across body size categories (regardless of species) uniform or multimodal?
- uniform would correspond generally to energetic equivalence/Damuth's rule.
- multimodal might suggest different resource availability for different body sizes.
2. If energy use is not uniform across body size categories, does the species level body size distribution correspond to modes of energy use?
- i.e. are there more species with mean body sizes around the modes of the body size-energy use distribution?
- if so, maybe it's good to be certain sizes, and species accumulate at those optima.


## Data

#### Ernest data
Ernest drew data from the Andrews LTER, the Sevilleta, Niwot Ridge, and Portal. For details of these field sites, see the _relevant publications cited in Ernest (2005)._ 

#### Translation to `replicate-becs`

The same datasets are available online, _from these sources_. 
```{r download data if not downloaded, include = F}
if(download_data) download_raw_paper_data(storagepath = storagepath)
```

```{r download raw data, eval = F}
download_raw_paper_data(storagepath = storagepath)
```

Process raw data into the appropriate format. This is a data table with a record for each individual and columns for `species` and `weight` in grams. By default these tables will be stored in subdirectores of `replicate-becs/data/paper/processed`. 

```{r process paper data}
process_raw_data(storagepath = storagepath)
```

Load data tables for each community. 

```{r load community data, echo=TRUE}
communities <- load_paper_data(storagepath = storagepath)
```

Each community should be a data table with columns for species and size for each individual, for example:

```{r inspect community data}
names(communities)[[1]]
head(communities[[1]])
```


#### Comparing data from 2005 to data available in 2019

Although the same datasets are now available online, they may have changed somewhat since 2005 (due to error checking, etc). Ernest (2005) may also have taken some cleaning and filtering steps, the details of which could have been omitted from the manuscript due to length restrictions. For example, many studies using data from the Portal Project omit ground squirrels, because although they may be within the "small mammal" size range, they are not target taxa for the sampling method. 

Ernest (2005) reported summary statistics of her datasets; here I compare them to the corresponding dataset in the 2019 data. 

```{r compare summary stats}
summary_stats_comparison = compare_summary_stats(storagepath = storagepath)

print(summary_stats_comparison)
```


## Constructing distributions/metrics

### Body size-energy use distributions (BSED)

#### Ernest method

- Per individual, calculate metabolic rate as metabolic rate $B \propto M^\frac{3}{4}$ where $M$ is mass in grams.
- Sum energy use of all individuals in body size classes of .2 natural log units.
- Also try classes of .1 and .3 natural log units
- Convert raw energy use values for each body size class into the proportion of all the energy used in that community used by that body size class. This allows for comparisons between communities.


![Ernest 2005 Fig 1](../analysis/ernest2005_fig1.png)


#### Translation to `replicate-becs`

For every individual, calculate metabolic rate and assign to a size class. 

```{r construct BSEDs}
communities_energy <- lapply(communities, FUN = make_community_table, ln_units = 0.2)

head(communities_energy[[1]])
```

For each community, sum total energy use for each size class, and convert to the proportion of total energy use for that community.

```{r make bseds}
bseds <- lapply(communities_energy, FUN = make_bsed)

head(bseds[[1]])
```

```{r plot bseds, echo=FALSE, fig.height=10, fig.width=10}

bseds_plot <- plot_paper_dists(bseds, dist_type = 'bsed')

invisible(bseds_plot)
```

### Species-level body size distributions (BSD)

#### Ernest method
- Frequency distributions of mean mass of each species in a community.
- For plotting (but not statistics), smoothed using kernel density estimation. 
- Gaussian kernel to mimic the actual body size distribution in log space
- avg. std dev of the mean of the logged masses = smoothing parameter $h$
- align sampling points with the midpoint of each size class in the BSED
- after Manly 1996, "Are there clumps in body-size distributions?", _Ecology_

#### Translation to `replicate-becs`

Calculate mean mass of each species in each community. 

```{r construct bsds} 

bsds <- lapply(communities, FUN = make_bsd) 

head(bsds[[1]])
```

```{r plot bsds, echo=FALSE, fig.height=10, fig.width=10}

bsds_plot <- plot_paper_dists(bsds, dist_type = 'bsd')

invisible(bsds_plot)
```


### Energetic dominance ($D_E$)

- Define "energy use modes" as contiguous body size classes where the energy use of each size class > 5% of the community total. 
- i.e. a little bit more than the expectation if energy use is uniform across all body sizes
- Calculate the total energy use for each species in the mode. 
- Calculate the "dominance" of the species with the highest energy use in that mode as $D_E = p_{max}$, where $p_{max}$ is the maximum proportion of energy use by any one species in a mode. 
- "a modification of the Berger-Parker dominance index (Berger and Parker 1970)"

#### Translation to `replicate-becs`

- Find contiguous size classes where each class has >5% of total energy use
- Calculate the total energy use for each species, and the proportion held by the species with the highest energy use ($p_{max}$)
- Return $p_{max}$ for every mode, along with the min and max size classes in that mode for each community

```{r energetic dominance}

energetic_dom <- lapply(communities_energy, FUN = energetic_dominance) 

head(energetic_dom[[1]])

```

- To plot, combine all modes from all communities and plot a histogram of $D_E$ values.

```{r plot Ed, echo=FALSE, fig.height=5, fig.width=5}
e_dom_plot <- plot_e_dom(energetic_dom)
e_dom_plot
```


## Statistical tests

### Comparing BSEDs to uniform

#### Ernest approach

- Use bootstrap sampling to compare to uniform distributions.
- For every community, draw 10000 samples (sim communities):
- Same number of individuals as the empirical community, drawn from a uniform distribution ranging from the smallest to largest ~~body size~~ individual metabolic rate of any individual in that community.
- For sim communities and the empirical community, calculate a distribution overlap index ($DOI$):
- $DOI = \sum_k {|y_{ak} - y_{bk}|}$ where $y$ is the value for size class $k$ in communities $a$ and $b$.
- $DOI$ values will range from 0 (complete overlap) to 2 (no overlap). 
- For the BSED bootstraps, community $a$ is the empirical or sim distribution, and community $b$ is a true uniform distribution ~~(i.e. $y_{bk} = \frac{1}{\max(k)}$ for all $k$)~~
- "True uniform distribution": There are exactly the same number of individuals of every size. 
- Calculate the $DOI$ for all sim communities and the empirical.
- Find the quantile value for the empirical $DOI$ compared to the distribution of sim $DOI$s. This is the p-value; i.e. the proportion of sim uniform distributions with DOIs greater than the empirical.

#### Translation to `replicate-becs`

- For a given empirical community, draw 10000 sim communities each with the same number of individuals $n$, with body sizes randomly drawn from a uniform distribution from the minimum to maximum body size in that community.
- Calculate the $DOI$ of each sim community compared to a true uniform distribution. 
- True uniform distribution = every size from the minimum to the maximum size in the community (by .1g) has exactly one individual.

```{r BSED-uniform bootstrapped DOIs, eval = F}

bsed_uniform_bootstraps <- lapply(communities, FUN = community_bootstrap,  bootstrap_function = 'bootstrap_unif_bsed_doi', nbootstraps = 10000)

```

```{r source or draw BSED uniform bootstraps, include = F}
if(source_sims) {
  load(file.path(storagepath, 'data', 'sims', 'bsed_uniform_bootstraps.Rdata'))
} else {
  
bsed_uniform_bootstraps <- lapply(communities, FUN = community_bootstrap,  bootstrap_function = 'bootstrap_unif_bsed_doi', nbootstraps = 10)

save(bsed_uniform_bootstraps, file = (file.path(storagepath, 'data', 'sims', 'bsed_uniform_bootstraps.Rdata')))
}
```


_See issue #4 on github._


```{r plot BSED-uniform bootstrap DOIs v empirical,echo=FALSE, fig.height=10, fig.width=10} 

bsed_uniform_bootstrap_plot <- plot_paper_dists(bsed_uniform_bootstraps, dist_type = 'bsed_bootstraps')

invisible(bsed_uniform_bootstrap_plot)
```

### Compare BSEDs among communities

#### Ernest approach
- For every pair of communities, create a pool of masses of all individuals from both communities.
- Draw two new communities with the same number of individuals as the empirical communities, pulling masses at random from the pool, with replacement.
- Calculate the DOI for the BSEDs of the two sample communities.
- Repeat 10000 for each pair.
- The P value is the proportion of sample DOIs greater (i.e. less overlap) than the empirical value. 

#### Translation to `replicate-becs`
- For every pair of communities, pool all the masses
- Resample two communities of the right sizes
- Construct BSEDs for both communities
- Calculate the DOI of the two BSEDs
- Repeat 10000x

```{r pairs for crosscommunity BSED comparisons}
community_combination_indices = utils::combn(x = c(1:9), m = 2, simplify = TRUE) %>%
  t() %>%
  as.data.frame() %>%
  dplyr::rename(community_a = V1, community_b = V2)

combine_communities = function(indices, communities) {
  community_combination = list(community_a = communities[[indices[1]]], community_b = communities[[indices[2]]], community_names = c(names(communities)[[indices[1]]], names(communities)[[indices[2]]]))
  
  return(community_combination)
}

community_combinations = apply(community_combination_indices, MARGIN = 1, FUN = combine_communities, communities = communities)

```

```{r cross community BSED comparisons, eval =F}

bsed_crosscomm_bootstraps = lapply(community_combinations, FUN = community_bootstrap, 
                                   bootstrap_function = 'bootstrap_crosscomm_bseds', nbootstraps = 10000)


```


```{r source or draw cross community BSED, include = F}
if(source_sims) {
  load(file.path(storagepath, 'data', 'sims', 'bsed_crosscomm_bootstraps.Rdata'))
} else {
  
bsed_crosscomm_bootstraps = lapply(community_combinations, FUN = community_bootstrap, 
                                   bootstrap_function = 'bootstrap_crosscomm_bseds', nbootstraps = 10)

save(bsed_crosscomm_bootstraps, file = (file.path(storagepath, 'data', 'sims', 'bsed_crosscomm_bootstraps.Rdata')))
}
```

```{r plot cross community comparisons, echo = F, fig.height=30, fig.width=10}

crosscomm_bootstrap_plot = plot_crosscomm_bseds(bsed_crosscomm_bootstraps)

invisible(crosscomm_bootstrap_plot)

```

```{r plot cross community p values, echo = F, fig.height = 5, fig.width = 5}
pvals_histogram = plot_bootstrap_pvals(bsed_crosscomm_bootstraps)

pvals_histogram
```


See histogram of p values for comparisons to see if commuities' BSEDs are the same or different.

### Testing BSDs for uniformity

#### Ernest approach
- $\delta$-corrected Kolmogorov-Smirnov test. 
- "The $\delta$-corrected K-S test increases the power of the test when sample sizes are small (n < 25; Zar 1999)"
- The $\delta$-corrected test is not widely discussed online. 


```{r ernest d ks test results}
ernest_key =  read.csv(file.path(storagepath, 'ernest-2005-files', 'ernest_key.csv'), stringsAsFactors = F)
ernest_bsds_uniform_results = read.csv(file.path(storagepath, 'ernest-2005-files', 'ernest_appendixA.csv'), stringsAsFactors = F) %>%
  dplyr::left_join(ernest_key, by = 'site')
print(ernest_bsds_uniform_results)
```


#### Translation to `replicate-becs`:


*From Zar (1999) _Biostatistical Analysis_.*

##### Base K-S test
- Take vector of measurements $X_i$. 
- For each $X_i$ record the observed frequency $f_i$ (number of observations with that value).
- Determine cumulative observed frequencies $F_i$ and cumulative relative frequencies $\textrm{rel}F_i$:
- $\textrm{rel}F_i = \frac{F_i}{n}$ where $n$ is the number of measurements taken. 
- $\textrm{rel}F_i$ is the proportion of the sample that is measurements $\leq X_i$. 
- For each $X_i$, determine the cumulative *relative* expected frequency from the comparison distribution, $\textrm{rel}\hat{F_i}$.
- For a uniform distribution, $\textrm{rel}\hat{F_i} = \frac{X_i - \min(X)}{\max(X) - \min(X)}$
- Determine $D_i$ and $D'_i$ as:
- $D_i = |{\textrm{rel}F_i - \textrm{rel}\hat{F_i}}|$
- $D'_i = |{\textrm{rel}F_{i-1} - \textrm{rel}\hat{F_i}}|$
- note $F_0 = 0$ so $D'_1 = \textrm{rel}\hat{F_i}$
- The test statistic $D$ is:
- $D = \max[(\max(D_i), (\max(D'_i)]$
- Compare to critical values from appendix.


##### $\delta$-corrected KS test

- For small sample sizes (<25) we can obtain increased power using the $\delta$-corrected KS test.
- For each $i$ determine
- $\textrm{rel}G_i = \frac{F_i}{n + 1}$
- $\textrm{rel}G'_i = \frac{F_i - 1}{n - 1}$
- Then obtain similar $D$s
- $D_{0, i} = |\textrm{rel}G_i - \textrm{rel}\hat{F_i}|$
- $D_{1, i} = |\textrm{rel}G'_i - \textrm{rel}\hat{F_i}|$
- The test statistic is either $\max(D_{0, i})$ or $\max(D_{1, i})$, whichever leads to the highest level of significance/smallest probability. Look up significance in table from appendix. The 1 and 0 are the $\delta$s. 


Tables of critical values were entered by hand from the appendix to Zar (1999).

```{r bsds deltaks to uniform, echo = F}
bsd_ks_test_raw = lapply(bsds, FUN = zar_ks_test, delta_correction = T,
                         focal_column = 'species_mean_mass',
                         expected_range = NULL,
                         n_or_i = 'n',
                         storagepath = storagepath)

bsd_ks_test_raw_results = data.frame(
  community_name = as.character(names(bsd_ks_test_raw)),
  signif = vapply(bsd_ks_test_raw, FUN = extract_values_zarks, val_name = "signif", FUN.VALUE = TRUE),
  p_max = vapply(bsd_ks_test_raw, FUN = extract_values_zarks, val_name = "p_max", FUN.VALUE = .1), 
  p_min = vapply(bsd_ks_test_raw, FUN = extract_values_zarks, val_name = "p_min", FUN.VALUE = .1),
  d_statistic = vapply(bsd_ks_test_raw, FUN = extract_values_zarks, val_name = "d", FUN.VALUE = .1),
  stringsAsFactors = F
) %>% 
  dplyr::left_join(ernest_bsds_uniform_results, by = 'community_name')

print(bsd_ks_test_raw_results)

```

The $\delta$ corrected KS test does not correspond to the results from Ernest when the species mean body size values are on an untransformed scale. 

Using the natural log of the species mean body size value, however...:

```{r bsds deltaks to uniform log, echo =F}
bsd_ks_test_log = lapply(bsds, FUN = zar_ks_test, delta_correction = T,
                         focal_column = 'ln_mass',
                         expected_range = NULL,
                         n_or_i = 'n',
                         storagepath = storagepath)

bsd_ks_test_log_results = data.frame(
  community_name = as.character(names(bsd_ks_test_log)),
  signif = vapply(bsd_ks_test_log, FUN = extract_values_zarks, val_name = "signif", FUN.VALUE = TRUE),
  p_max = vapply(bsd_ks_test_log, FUN = extract_values_zarks, val_name = "p_max", FUN.VALUE = .1), 
  p_min = vapply(bsd_ks_test_log, FUN = extract_values_zarks, val_name = "p_min", FUN.VALUE = .1),
  d_statistic = vapply(bsd_ks_test_log, FUN = extract_values_zarks, val_name = "d", FUN.VALUE = .1),
  stringsAsFactors = F
) %>% 
  dplyr::left_join(ernest_bsds_uniform_results, by = 'community_name')

print(bsd_ks_test_log_results)
```

With mean mass logged, all the results replicate qualitatively (i.e. not significantly different from uniform) and Niwot, for which the currently-available data most closely matches that reported in Ernest (2005), replicates almost exactly numerically. 

### Comparing BSDs among communities

#### Ernest approach

Ernest (2005) used a two-sample Kolmogorov-Smirnov test to compare every possible combination of community-level BSDs. 

```{r ernest ks two sample results, echo =F}
appendix_b = tidy_appendix_b(storagepath = storagepath) %>%
  dplyr::left_join(ernest_key, by = c('site_a' = 'site')) %>%
  dplyr::rename(community_a = community_name) %>%
  dplyr::left_join(ernest_key, by = c('site_b' = 'site')) %>%
  dplyr::rename(community_b = community_name)

print(appendix_b)

```



#### Translation to `replicate-becs`

```{r ks two sample}

# use same community combinations as before

bsd_crosscomm_ks = lapply(community_combinations, FUN = ks_bsd, 
                          ln_mass_vals = F)

```

```{r print ks two sample, echo = F}
bsd_crosscomm_results = data.frame(
  community_a = vapply(bsd_crosscomm_ks, FUN = extract_values_tsks, val_name = "community_a", FUN.VALUE = 'portal'),
  community_b = vapply(bsd_crosscomm_ks, FUN = extract_values_tsks, val_name = "community_b", FUN.VALUE = 'portal'),
  ks_d =  vapply(bsd_crosscomm_ks, FUN = extract_values_tsks, val_name = "statistic", FUN.VALUE = .5),
    p_value =  vapply(bsd_crosscomm_ks, FUN = extract_values_tsks, val_name = "p.value", FUN.VALUE = .5),
  stringsAsFactors = F
)


bsd_crosscomm_joined = bsd_crosscomm_results %>%
  dplyr::left_join(appendix_b, by = c('community_a', 'community_b')) %>%
  na.omit()

bsd_crosscomm_joined2 = bsd_crosscomm_results %>%
  dplyr::rename(community_b = community_a, community_a = community_b) %>%
  dplyr::left_join(appendix_b, by = c('community_a', 'community_b')) %>%
  na.omit() %>%
  dplyr::bind_rows(bsd_crosscomm_joined)

print(bsd_crosscomm_joined2)

```

```{r plot bsd crosscomm pval hist, echo =F, fig.height=5, fig.width=10} 
bsd_pvals_histogram = plot_crosscomm_ks_pvals(bsd_crosscomm_joined2)

invisible(bsd_pvals_histogram)
```

